On- and Off-Policy Monotonic Policy Improvement
نویسندگان
چکیده
Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this study, we show that the monotonic policy improvement is guaranteed from onand off-policy mixture data. Based on the theoretical result, we provide an algorithm which uses the experience replay technique for trust region policy optimization. The proposed method can be regarded as a variant of off-policy natural policy gradient method.
منابع مشابه
Point-Based Policy Iteration
We describe a point-based policy iteration (PBPI) algorithm for infinite-horizon POMDPs. PBPI replaces the exact policy improvement step of Hansen’s policy iteration with point-based value iteration (PBVI). Despite being an approximate algorithm, PBPI is monotonic: At each iteration before convergence, PBPI produces a policy for which the values increase for at least one of a finite set of init...
متن کاملEasy Monotonic Policy Iteration
A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or Q-function may fail to improve performance—or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight ...
متن کاملTrust Region Policy Optimization
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks...
متن کاملSafe Policy Iteration
CONTRIBUTIONS 1. Theoretical contribution. We introduce a new, more general lower bound to the policy improvement of an arbitrary policy compared to another policy based on the ability to bound the distance between the future state distributions. 2. Algorithmic contribution. We define two approximate policy–iteration algorithms whose policy improvement moves toward the estimated greedy policy b...
متن کاملIntegral Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space
Policy iteration (PI) is a recursive process of policy evaluation and improvement to solve an optimal decision-making, e.g., reinforcement learning (RL) or optimal control problem and has served as the fundamental to develop RL methods. Motivated by integral PI (IPI) schemes in optimal control and RL methods in continuous time and space (CTS), this paper proposes on-policy IPI to solve the gene...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.03442 شماره
صفحات -
تاریخ انتشار 2017